[RVM 2021] Robust High-Resolution Video Matting with Temporal Guidance:字节,temporal(ConvGRU),multi-task

[SparseInst 2022] Sparse Instance Activation for Real-Time Instance Segmentation:自动化所,overview有点像DETR

Robust High-Resolution Video Matting with Temporal Guidance

  1. 动机
    • human video matting
      • 用于背景替换
      • 现有技术不稳定,会产生artifacts
    • performance
      • robust
      • real-time
        • 4K at 76 FPS and HD at 104 FPS on Nvidia GTX 1080Ti
      • high-resolution
    • use recurrent structure instead of frame by frame:用时序网络,分割质量更好
    • propose a novel training strategy:同时进行matting和segmentation两个任务,模型更鲁邦
  2. 论点
    • matting formulation recollect
      • $I = \alpha F + (1-\alpha)B$
    • matting methods
      • Trimap-based matting:最classical的,需要额外的先验,且通常不分类,只做前景语义分割
      • Background-based matting:不要先验证trimap了,改要先验background map
      • Segmentation:就是binary的语义分割,人像前景效果还好,但是背景容易出现各种artifacts,比较不稳定
      • Auxiliary-free matting:不需要额外输入的架构,MODNet更关注肖像,this paper更关注目标人
      • Video matting:
        • MODNet用相邻帧图像的预测结果来相互压制伪影,本质上仍就是image-independent
        • BGM用多帧图像作为多通道
      • Recurrent architecture:ConvLSTM/ConvGRU
      • High-resolution matting
        • Patch-based refinement:图像尺寸减小,以获取high resolution task的算力,但是
        • Deep guided filter:trainable,模块化,end-to-end将low-reso转换成high-reso
    • use temporal structure
      • temporal information boosts both quality and robustness
      • 这种overtime的背景变换使得模型对背景信息的学习更加鲁邦和精确
    • introduce a new training strategy
      • 大多数matting数据集都是合成的,包括在数据处理阶段也会做这种前景贴背景的操作,扩充样本量,这种图像太fake了,和实际场景有domain gap,泛化性差
      • 也有方法尝试先在segmentation任务上做预训练、用真实图像做对抗等方式去解决图像假的问题,这样的缺点是multi step
      • 同时训练matting & segmentation任务就一步到位了,没有额外的adaptation steps
  3. 方法
    • model architecture overview
      • encoder:编码individual frame’s features,mobileNetv3/resnet50
      • recurrent decoder:aggregates temporal information
      • a Deep Guided Filter module:high-resolution upsampling
    • Feature-Extraction Encoder
      • MobileNetV3-Large + LR-ASPP module
      • 最后一个block使用了空洞卷积
    • Recurrent Decoder
      • ConvGRU at multiple scales
      • bottleneck block:x16 level上
        • 在LR-ASPP之后
        • 后ConvGRU,with id path(split,一半通道用于id,一半通道用于GRU)
        • 然后bilinear 2x
      • Upsampling block:x8/x4/x2 level上
        • 每个resolution stage
        • 先merge(concat)前一个stage的feature
        • 然后avg pooling,conv-bn-relu,transfer the feature
        • 然后ConvGRU,with id path
        • 然后bilinear 2x
      • Output block:x1 level上
        • 去做一个final prediction
        • 先merge
        • 然后【conv3x3-bn-relu】x2
        • 然后conv1x1 head:1-channel alpha/3-channel fg/1-channel segmentation
    • Deep Guided Filter Module
      • given high- resolution videos such as 4K and HD
      • 先下采样by a factor s
      • 然后输入网络
      • 最后网络的2个输出(alpha & fg)、网络output block的hidden feature、以及HR的原图这四个信息都给到DGF,to produce high-resolution的alpha和foreground
  4. 实验
    • training details
      • progressive learning:see longer sequences and higher resolution
      • loss:
        • matting loss(alpha / fg):L1 & pyramid Laplacian loss + additional temporal coherence loss
        • segmentation loss:BCE

Sparse Instance Activation for Real-Time Instance Segmentation

  1. 动机
    • fully convolutional real-time instance segmentation
    • former work的实例分割通常与目标检测绑定
      • dense anchors
      • fixed reception field by fixed anchors
      • multi-level prediction
      • ROI-Align对移动端/嵌入式设备不友好
      • NMS time-consuming
    • this paper
      • a sparse set of activation maps:类似detr的100个proposal
      • 基于attention map得到instance-level的features
      • 匈牙利算法来匹配proposed instance和gt,从而省略NMS,得到稀疏预测
      • 40 FPS and 37.9 AP on the COCO benchmark
      • repo:https://github. com/hustvl/SparseInst
  2. 论点
    • this paper
      • IAM:instance activation maps,sparse set,motivated by CAM
        • pixel-level:相比较于框里还有bg
        • 全局context & single-level
        • simple op:avoid ROI-Align/NMS这些不可避免的循环操作
        • bipartite的稀疏监督:inhibit the redundant predictions, thus avoiding NMS
      • recognition and segmentation:在IAM的instance feature基础上执行下游任务
    • overall structure
      • encoder:backbone + PPM,giving x8 fused features
      • decoder:multi-branch
        • instance branch:IAM,
        • mask branch:语义分割,
  3. 方法
    • IAM:Instance Activation Maps
      • 首先一个基本假设:encoder得到的feature是redundant
      • IAM的op
        • 一个id分支,传入原始feature,[b,h,w,d]
        • 一个feature selection分支(conv+sigmoid+norm),[b,h,w,N]
        • 两个分支做矩阵乘法,[b,N,d]:feature selection分支,给出了基于原始feature的N forms of spatial reweighting方案,作为最终的attention proposals
      • downstream task:recognition and segmentation
        • kernel
        • class
        • score